[PATCH] Fix two out-of-bounds read issues when handling truncated UTF-8 input (#1005)
authorfrankslin <frankslin@users.noreply.github.com>
Tue, 13 Jan 2026 00:51:38 +0000 (16:51 -0800)
committerBoyuan Yang <byang@debian.org>
Wed, 14 Jan 2026 00:17:36 +0000 (19:17 -0500)
commit9c6f59ab824591c0033f98ae693307e8ae1b1ca9
treee840679a52482a9f54aee7d014519229f7ea59d3
parent2052e9b7d672357c254ce0088110e2ec67d2f0ed
[PATCH] Fix two out-of-bounds read issues when handling truncated UTF-8 input (#1005)

Two independent out-of-bounds read issues were identified in OpenCC's UTF-8
processing logic when handling malformed or truncated UTF-8 sequences.

1) MaxMatchSegmentation:
   NextCharLength() could return a value larger than the remaining input size.
   The previous logic subtracted this value from a size_t length counter,
   potentially causing underflow and subsequent out-of-bounds reads.

2) Conversion:
   Similar length handling could allow reads past the end of the input buffer
   during dictionary matching, potentially propagating unintended bytes to the
   conversion output.

This patch fixes both issues by:
- Explicitly tracking the end of the input buffer
- Recomputing remaining length on each iteration
- Clamping matched character and key lengths to the remaining buffer size
- Preventing reads past the null terminator

The changes preserve existing behavior for valid UTF-8 input and add test
coverage for truncated UTF-8 sequences.

These issues may have security implications when processing untrusted input
and are classified as heap out-of-bounds reads (CWE-125).

Co-authored-by: Claude <noreply@anthropic.com>
Applied-Upstream: https://github.com/BYVoid/OpenCC/commit/345c9a50ab07018f1b4439776bad78a0d40778ec

Gbp-Pq: Topic backport
Gbp-Pq: Name 345c9a50ab07018f1b4439776bad78a0d40778ec.patch
src/Conversion.cpp
src/ConversionTest.cpp
src/MaxMatchSegmentation.cpp
src/MaxMatchSegmentationTest.cpp